Skip to main content

Implementing scikit-learn's piplines

  • In the previous recipes, we showed all the steps required to build a machine learning model —starting with loading data, splitting it into a training and a test set, imputing missing values, encoding categorical features, and—lastly —fitting a XG Boosting classifier.
  • The process requires multiple steps to be executed in a certain order, which can sometimes be tricky with a lot of modifications to the pipeline mid- work. That is why scikit-learn introduced Pipelines. By using Pipelines, we can sequentially apply a list of transformations to the data, and then train a given estimator (model).
  • One important point to be aware of is that the intermediate steps of the Pipeline must have the fit and transform methods (the final estimator only needs the fit method, though). Using Pipelines has several benefits:
    • The flow is much easier to read and understand—the chain of operations to be executed on given columns is clear
    • The order of steps is enforced by the Pipeline
    • Increased reproducibility
  • In this recipe, we show how to create the entire project's pipeline, from loading the data to training the classifier.

How to do it...

Execute the following steps to build the project's pipeline.

  1. Import the libraries:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
class OutlierRemover(BaseEstimator, TransformerMixin):
def __init__(self, n_std=3):
self.n_std = n_std
def fit(self, X, y = None):
if np.isnan(X).any(axis=None):
raise ValueError('''There are missing values in the
array! Please remove them.''')
mean_vec = np.mean(X, axis=0)
std_vec = np.std(X, axis=0)
self.upper_band_ = mean_vec + self.n_std * std_vec
self.lower_band_ = mean_vec - self.n_std * std_vec
self.n_features_ = len(self.upper_band_)
return self
def transform(self, X, y = None):
X_copy = pd.DataFrame(X.copy())
upper_band = np.repeat(
self.upper_band_.reshape(self.n_features_, -1),
len(X_copy),
axis=1).transpose()
lower_band = np.repeat(
self.lower_band_.reshape(self.n_features_, -1),
len(X_copy),
axis=1).transpose()
X_copy[X_copy >= upper_band] = upper_band
X_copy[X_copy <= lower_band] = lower_band
return X_copy.values
class RandomOverSamplerTransformer(BaseEstimator, TransformerMixin):
def __init__(self, random_state=42):
self.random_state = random_state
self.ros = RandomOverSampler(random_state=self.random_state)

def fit(self, X, y=None):
# The RandomOverSampler's fit method expects X and y, but it doesn't actually use y.
# Hence, we only pass X to it.
_, _ = self.ros.fit_resample(X, y)
return self

def transform(self, X):
# Since we've already oversampled in the fit method, transform can simply return X.
# The transformation has already been applied during the fit method.
return X
  1. Load the data, separate the target, and create the stratified train-test split:
df = pd.read_csv('data.csv', low_memory=False, index_col=0, na_values='')
X = df.copy()
y = X.pop('target')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
  1. Store lists of numerical/categorical features:
num_features = X_train.select_dtypes(include='number').columns.to_list()
cat_features = X_train.select_dtypes(include='object').columns.to_list()
  1. Define the numerical Pipeline:
num_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')), # Filling missing values with median
('outliers', OutlierRemover()), # Custom transformer for outlier removal
('imbalance', RandomOverSamplerTransformer(random_state=42)) # Oversampling to handle class imbalance
])
  1. Define the categorical Pipeline:
cat_list = [list(X_train[col].dropna().unique()) for col in cat_features]
cat_pipeline = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(categories=cat_list, sparse=False, handle_unknown='error', drop='first')),
('imbalance', RandomOverSamplerTransformer(random_state=42))
])
  1. Define the column transformer object:
preprocessor = ColumnTransformer(transformers=[
('numerical', num_pipeline, num_features),
('categorical', cat_pipeline, cat_features)],
remainder='drop')
  1. Create a joint Pipeline:
xgb = XGBClassifier(random_state=42)
xgb_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', xgb)])
  1. Fit the Pipeline to the data:
xgb_pipeline.fit(X_train, y_train)
  1. Evaluate the performance of the entire Pipeline:
accuracy = accuracy_score(y_test, xgb_pipeline.predict(X_test))
roc_score = roc_auc_score(y_test, xgb_pipeline.predict(X_test))
precision = precision_score(y_test, xgb_pipeline.predict(X_test), average='macro')
recall = recall_score(y_test, xgb_pipeline.predict(X_test), average='macro')
f1 = f1_score(y_test, xgb_pipeline.predict(X_test), average='macro')
kappa = specificity_score(y_test, xgb_pipeline.predict(X_test))
print("Accuracy:", accuracy)
print("ROC Score:", roc_score)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Specificity Score", kappa)

How it works...

  • In Step 1, we imported the required libraries. The list can look a bit daunting, but that is due to the fact that we need to combine multiple functions/classes used in the previous recipes.
  • In Step 2, we loaded the data from a CSV file, separated the target variable and created the stratified train-test split.
  • Next, we also created two lists containing the names of the numerical/categorical features—we will apply different transformations depending on the type of the feature. To select the appropriate columns, we used the select_dtypes method.
  • In Step 4, we started preparing the separate Pipelines. For the numerical one, we only wanted to impute the missing values of the features using the column median. For the Pipeline, we provided a list of tuples containing the steps, each of the tuples containing the name of the step (for easier identification) and the class we wanted to use, in this case, it was the SimpleImputer.
  • In Step 5, we prepared a similar Pipeline for categorical features. This time, however, we chained two different operations—the imputer (using the most frequent value), and the one-hot encoder. For the encoder, we also specified a list of lists called cat_list, in which we listed all the possible categories, based on X_train. We did so mostly for the sake of the next recipe, in which we introduce cross-validation, and it can happen that some of the random draws will not contain all the categories.
  • In Step 6, we defined the ColumnTransformer object, which we used to manipulate the data in the columns. Again, we passed a list of tuples, where each tuple contained a name, one of the Pipelines we defined before, and a list of columns to which the transformations should be applied. We also specified remainder='drop', to drop any extra columns to which no transformations were applied. In this case, the transformations were applied to all features, so no columns were dropped.
  • In Step 7, we once again used Pipeline to chain the preprocessor (the previously defined ColumnTransformer object) with the decision tree classifier (for comparability, we set the random state to 42).
  • The last two steps involved fitting the entire Pipeline to the data and using the custom function to measure the performance of the model.

There's more..

  • In this recipe, we showed how to create the entire pipeline for a data science project. However, there are many other transformations we can apply to data as preprocessing steps. Some of them include:

    • Scaling numerical features: In other words, changing the range of the features due to the fact that different features are measured on different scales; and this can introduce bias to the model. We should mostly be concerned with feature scaling when dealing with models that calculate some kind of distance between features (such as K-Nearest Neighbors). In general, methods based on decision trees do not require any scaling. Some popular options from scikit-learn include StandardScaler and MinMaxScaler.
    • Discretizing continuous variables: We can transform a continuous variable (such as age) into a finite number of bins (such as below 25, between 25 and 50, and older than 50). When we want to create specific bins, we can use the pd.cut function, while pd.qcut is used for splitting based on quantiles.
  • We designed the class similarly to the ones in scikit-learn, meaning we can train it on the training set, and only use the transformation on the test set.

  • In the init method, we stored the number of standard deviations that determines whether observations will be treated as outliers (the default is 3). In the fit method, we stored the upper and lower thresholds for being considered an outlier, as well as the number of features in general. In the transform method, we capped all the values, according to the 3σ symbol rule.

  • One known limitation of this class is that it does not handle missing values. That is why we raise a ValueError when there are any missing values. In the Pipeline, we use the OutlierRemover after the imputation in order to avoid the issue. We could, of course, account for the missing values in the transformer, however, this would make the code longer and less readable. Please refer to the definition of SimpleImputer in scikit-learn for an example of how to mask missing values while building transformers.